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1 Abstract 

In this research we use a decentralized computing approach to allocate and 
schedule tasks on a massively distributed grid. Using emergent properties of 
multi-agent systems, the algorithm dynamically creates and dissociates clusters 
to serve the changing resource demands of a global task queue. The algorithm is 
compared to a standard first-in first-out (FIFO) scheduling algorithm. Experi¬ 
ments done on a simulator show that the distributed resource allocation protocol 
(dRAP) algorithm outperforms the FIFO scheduling algorithm on time to empty 
queue, average waiting time and CPU utilization. Such a decentralized comput¬ 
ing approach holds promise for massively distributed processing scenarios like 
SETI@home and Google MapReduce. 

2 Introduction 

Recent years have seen a trend in moving large computational tasks to col¬ 
lections of inexpensive, commercial off-the-shelf (COTS) computers that are 
geographically distributed. This has contributed significantly to the advance¬ 
ment of science by providing access to large-scale shared computing resources 
on which to solve computationally expensive problems. Some common exam¬ 
ples are SETI@home [T] which runs tasks on millions of computers worldwide 
and Google MapReduce [5] which distributes calculation of web crawled metrics 
among thousands of computers. This move towards distributed computing has 
created a need for efficient task allocation and scheduling algorithms. Such al¬ 
gorithms should be very scalable since these systems typically have thousands 
to millions of computers. They should also be robust to single-point failures and 
be adaptive to task demand. Recent research on grid resource allocation has fo¬ 
cused on volunteer resource allocation, agreement-based resource allocation and 
economic resource allocation Multi-agent decentralized systems offer an ex¬ 
citing approach to distributed resource allocation. They have emergent global 


properties which arise from local interactions and have been previously used 
to model biological phenomena |lbl:-il7ISI4l2lblTTl| and solve real-world problems 
|lKlllll2ll4lldl . Here we use such a decentralized computing approach to al¬ 
locate and schedule tasks on a grid. The remainder of the paper is organized 
as follows: Section 3 formalizes the problems and states the assumptions, Sec¬ 
tion 4 briefly reviews decentralized computing and the advantages it can afford 
to a distributed allocation problem, Section 5 introduces multi-agent systems. 
Section 6 introduces the simulator used for the experiments in this paper. Sec¬ 
tion 7 discusses the dRAP algorithm, Section 8 deals with analysis of the cost of 
searching through the global queue. Section 9 discusses some dRAP optimization 
techniques influenced by the immune system. Section 10 deals with experiments 
and results. Section 11 discusses related work in this area and Section 12 presents 
concluding remarks and outlines future work. 


3 Statement of Problem and Assumptions 


Assume there is a queue Q of processes waiting to be allocated to processors. 
Each process is required to declare a priori its resource requirements viz. the 
number of threads into which it can be parallelized (THn) and the number of 
system resources it requires (the number of CPUs is assumed to be equal to the 
number of threads which can be run in parallel, CPUreq)- Our system departs 
from traditional resource allocation techniques in that there is no centralized dis¬ 
patcher. Instead, we dynamically organize a system of geographically distributed 
computers into clusters to service each process in Q. Over time, clusters of com¬ 
puters are dynamically created, dissociated and created again in order to serve 
the resource requirements of the processes in Q. We define a cluster as a network 
of computers which together can completely service the resource requirements 
of a single process. Clusters of computers are created so as to be proximal to 
each other in order to reduce latency and communication costs. 

We acknowledge the following assumptions in our system: 


1. Distributed computers can communicate with each other. 

2. There are advantages to computing with geographically proximal computers 
due to network latency and bandwidth limitations. 

3. A new process Pi that comes in the system will declare a priori the number 
of threads that it can be parallelized into and its resource requirements (e.g. 
the number of CPUs it will require, I/O devices required, amount of memory, 
etc). 

4. The approach will become viable in the asymptotic region of millions or 
billions of geographically dispersed computers, when there will be expected 
benefits from a decentralized computing approach that exploits geographical 
proximity and reduces latency costs, as opposed to a centralized monitor. 






4 Decentralized Computing 


The extreme size of the computing grid and an ever-increasing demand for 
computational power places exacting demands on any scheduling, allocation 
and load-balancing algorithm. Here we argue that a decentralized computing 
paradigm presents an ideal solution to the bottlenecks and single-point failures 
inherent to a centralized monitor tasked with allocating resources and balancing 
loads in the grid: 

1. The workload assigned to a centralized monitor increases as computers are 
added to the computing grid. A decentralized approach can alleviate the 
computing load on monitors. In this approach, each individual computer, or 
cluster of computers, will do some computation. 

2. A centralized monitor makes the system susceptible to single-point failures. 
Distributing load balancing and resource allocation tasks to individual com¬ 
puters will increase system robustness. 

3. Individual computing nodes are naturally aware of their own workloads. As 
a result, the decentralized paradigm can achieve application-level resource 
management with significantly less communication overhead than a central¬ 
ized monitor. 

4. A decentralized system uses peer-to-peer networking to scale communication 
as the system grows, whereas a centralized monitor has to communicate with 
an increasing number nodes. 

5. A decentralized system is more robust to single node disruptions and failures, 
whether malicious or benign. 

6. A decentralized system may be able to better respond to fluctuations in 
process requirements e.g. in a scenario where the scheduler has to “forget” 
past process requirements and completely rebuild new clusters after servicing 
one process i.e. there is no locality in process requirements. 

5 Multi-Agent Systems 

Multi-agent systems use distributed agents to either model or solve a problem. 
An agent is an entity which matches some real-world object. It could be a bi¬ 
ological cell, a virus particle, an ant or in our case an individual computer. A 
computer program encodes simple rules or behaviors for interacting with other 
agents. The agents move about in space and interact with other agents in their 
neighborhood according to the encoded rules. Thus the behavior of low-level en¬ 
tities is specified and high-level behaviors evolve as simulation time progresses. 
Multi-agent systems emphasize local interactions based on first principles, and 
these interactions give rise to the complex high-level emergent properties of in¬ 
terest. Such systems have been used to model biological phenomenon such as the 
human immune system m. as well as solve real-world problems like communica¬ 
tion between distributed radar transmitters m and efficient resource collection 
in swarms of foraging robots [11 II 2114113) . 




There is no centralized dispatcher to facilitate the formation and dissociation 
of clusters in the proposed dRAP algorithm. Instead, the algorithm relies on the 
self-emergent properties of a multi-agent system. A multi-agent or agent-hased 
system is an architecture in which the global properties of the system emerge 
from local interactions. 

The concept of a decentralized system presents a powerful counterpoint to 
the more common centralized control model often seen in business, government, 
and military organizations. Decentralization provides a number of important 
advantages over closed systems, such as robustness, adaptability, flexibility, in¬ 
novation, and distributed intelligence. The key to this compelling architecture 
is the impressive ability of a decentralized system to react, mutate, or grow in 
response to challenging situations. 

In any such decentralized system, the agent represents the base unit of com¬ 
puting power for the system. It behaves according to very simple rules. At each 
unit of model time (or time step), the agent senses its immediate local environ¬ 
ment and takes actions based on its encoded rules. One rule might instruct the 
agent to divide if the number of neighbors is greater than 3, while another would 
cause it to die and be removed from the simulation if the number of neighbors is 
less than 2. These two examples are rules in the “Game of Life” cm, a paradig¬ 
matic system where complex patterns arise from local interactions and simple 
rules. 

If we recast each agent’s local sensing functionality as a peer-to-peer commu¬ 
nication protocol with other nearby agents, then we can define a new set of rules 
for each agent that induce actions based on the state of these other, neighboring 
agents. Using this localized communication scheme, such rule-action pairs can be 
viewed as instructions for individual agents that produce decentralized compu¬ 
tation across the system. There is no centralized monitor and yet this system is 
capable of performing complex computations. In fact, the computational power 
of such a system of distributed agents acting on simple rules has been proven to 
be Turing-complete [S]. 

We use such an agent-based system to dynamically create and dissociate 
clusters based on the resource requirements of each process. A snapshot of this 
system is shown in Figures]^ and 


6 Software Platform 

For this project we utilize the multi-agent simulation toolkit MASON [18]. MA¬ 
SON consists of a fairly small and portable set of Java library files that provide 
for design of both model (the “algorithm” component) and visualization (the 
“graphical user interface” component). 

The agent, the base component of computation in MASON (as in any multi¬ 
agent system), is coded in the familiar object-oriented programming format: the 
class “Agent” that contains all generalized methods and parameters needed for 
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Fig. 1. Agents in large clusters; 1 free agent 


Fig. 2. Agents in several smaller clusters 


the object “agent” that is simply an instantiation of the Agent class. Following 
this format, each instantiated agent may contain a unique set of parameters, 
thereby allowing for minor variation in the replicated objects. 

Agents are allowed to make decisions (and even communicate with one an¬ 
other) in a randomized batch lock-step. That is, the MASON scheduler moves 
through the (randomized) queue of all agents at each time step of the simula¬ 
tion. Scheduling of agents continues as long as the simulation itself is running, 
although the user may interrupt at any point by pausing or stopping the model. 

MASON in particular was selected because of its all-in-one toolkit approach, 
making multi-agent simulation much easier than if done from scratch, as well as 
the authors’ familiarity and experience with the MASON system. 

7 dRAP Algorithm 

The distributed resource allocation protocol (dRAP) is described below and some 
intended optimizations are suggested for future work. An agent in our system is 
simply a computer. Each agent has a vector containing the time remaining to 
finish executing its current process {timerem) and the number of CPUs in its 
current cluster {CPUduster)■ Each agent (or node) is guaranteed to be in exactly 
1 of 4 modes (or states) during the simulation: 

Mode 1: An agent/node that is currently not part of a cluster and has no task 
assigned to it 

1. The agent scans the queue Q, considers the resource requirements CPUreq 
of unallocated tasks, and takes on the task which minimizes the equation 
\CPUueq - 1 |. 

Mode 2: An agent/node that is currently not part of a cluster and has a task 
assigned to it 

1. The agent continues executing the task and updates its information vec¬ 
tor {tivnCrem^ CPUduster)- 


2. If the task requirements are not completely satisfied (i.e. if CPUreq > 1), 
the agent will query its neighbors and attempt to form a cluster such 
that CPUj-^q — ^PU 

3. When the agent finishes executing the task, it returns to Mode 1. 
Mode 3: An agent/node that is currently part of a cluster and has no task 

assigned to it 

1. The agent scans the queue Q, considers the unallocated tasks, and takes 
on the task which minimizes the equation \CPUreq — CPUdusteA 
Mode 4: An agent/node that is currently part of a cluster and has a task 
assigned to it 

1. The agent continues executing the task and updates its information vec¬ 
tor {tilfnCrem^ CPUduster) 

2. When the task completes, the agent dissociates from the cluster and 
returns to Mode 1. 

A key feature of our algorithm is that nodes query their neighbors (other 
nodes that are close to them physically) in order to form clusters. This has the 
effect of reducing latency and communication costs. One optimization to consider 
would be to delay cluster dissociation in Mode 4. This would lead to learning 
or memory in the system where the scheduler would be able to remember past 
process requirements. 


8 Analysis of Queue Cost 

The dRAP algorithm requires a traversal through the global task queue in Mode 
1 and Mode 3. The algorithmic complexity is given by ^ (n — i)m = 0{n'^m) 
where m = the number of tasks in the global task queue, and n = the average 
number of clusters. At a given timestep, the worst case can be approximated as 
0(nm). 

9 Optimizations Inspired by the Immune System 

The immune system is able to hnd rare spatially localized pathogens and elimi¬ 
nate them in a timely manner HE]. Similar to how in our system clusters of com¬ 
puters find processes, the immune system uses specialized cells to find pathogens 
in anatomical regions called lymph nodes. In previous work we showed how a sob- 
modular arrangement of lymph nodes could lead to fast elimination of pathogens 
in the immune system and also faster search for solutions in immune inspired 
distributed systems of computers iiEns]. Let an artihcial lymph node be com¬ 
posed of a number of clusters and a process queue. Also let there be a number 
of such artificial lymph nodes that have the capability of communicating with 
each other. An ‘artificial lymph node' is supposed to be a computer in charge of 
a number of clusters. This computer will store the process queue and also will 
have some memory and CPU to communicate with other ‘lymph nodes' 




We are interested in making the system sub-modular so that we can minimize 
the total time to find a cluster. There is a tradeoff between the local cost and the 
global cost; The local cost is 0{n?) and the global cost 0{N/n). The total cost 
of traversing through the queue in a lymph node and the cost of communicating 
with other lymph nodes can be summed up as : 


^total — ilocal T iglobal 

Uotai = 0(n^) + 0{N/n) 

where where n is the number of clusters in a single lymph node and N is the 
total number of clusters in the complete system. We assume that the global cost 
of finding another cluster in another lymph node that can service some process 
requirement is proportional to the number of lymph nodes (where N/n is the 
number of lymph nodes in the system). 

Minimizing the total time cost, we get 2n — N/n? = 0 

n = (3) 

This implies that in larger systems (more computers, more clusters and more 
lymph nodes), the number of clusters within a single lymph node should grow 
larger but only sublinearly in the number of total clusters in the system. This 
would balance local costs of queue traversal and global costs of finding another 
lymph node with another cluster that can service the process. The key point here 
is that the number of clusters in a lymph node should scale sub-linearly with 
the size of the whole system, i.e. if a system of networked artificial lymph nodes 
were to grow a 1000 times bigger (1000 times more clusters), then the number 
of clusters within a lymph node need only increase by a factor of 10. Such sub- 
modular systems inspired by the immune system have been proposed previously 
for mobile ad-hoc networks, control of mobile robots, intrusion detection systems 
and peer-to-peer networks mm- 

More generally, if the local and global communication costs scale with expo¬ 
nents a and /3, we have 


( 1 ) 

( 2 ) 


ttotai = 0{n^)+0{Nyn^) (4) 

Minimizing the expression with respect to N, we get 

n = 0(iV“+?) (5) 

1. If 7 < a -f /3 we have sub-linear scaling. 

2. If 7 > a -f /3 we have super-linear scaling. 

3. If 7 = a -f /3 we have linear scaling. 

4. If 'y/{a + 13) =0 we have no scaling (constant). 

5. If 'y/{a + f3) <0 we have negative scaling. 




10 Experiments 


We conduct several experiments that compare our dRAP algorithm to a null 
model, i.e. a first-in first-out (FIFO) scheduling system. Additionally, we mea¬ 
sure the effective computational complexity of queue traversals and examine the 
scaling properties of our system by varying the number of nodes and measuring 
the effect on performance. We define two timing metrics on which our system 
performance will be judged: Tcompiete is the time required to complete all tasks 
in the queue, and T^^ait is the average wait time for a task added to the queue. 
Unless otherwise noted, system parameters are defined as such: number of nodes 
= 100, number of tasks = 1000, tasks are randomly selected from a normal dis¬ 
tribution s.t. CPUreq varies from 1 to 5, with initial timerem varying from 25 
to 125 in increments of 25. That is, a task U with CPUreq = 1 has an initial 
timerem = 25, and a task tj with CPUreq = 5 has an initial timerem = 125. All 
averages are computed across 10 trials. 


10.1 Comparison to Null Model 

Here we present three separate experiments which compare the dRAP and FIFO 
algorithms. The first is a simple timing comparison that looks at Teompiete and 
Twait for each case. Values are presented in Table (including 95% confidence 
intervals). 



Teompiete 

Tyjait 

dRAP 

FIFO 

845.60 (861.94,829.26) 
1071.20 (1088.99,1053.41) 

342.54 (349.30,335.78) 
475.31 (485.79,464.82) 


Table 1. Average timing comparison of dRAP and FIFO scheduling algorithms with 
95% confidence intervals. 


We observe an approximate 20% reduction in Teompiete and an approximate 
25% reduction in Tujait when comparing dRAP to FIFO. 

Our second experiment comparing dRAP and FIFO involves average cluster 
utilization. Because dRAP assigns tasks s.t. CPUduster == CPUreq, this en¬ 
sures that all nodes in the cluster will be utilized. However, the FIFO scheduling 
system hands out tasks to the first available cluster, meaning it allows for the 
possibility that CPUduster > CPUreq- For example, a task with CPUreq = 2 
that is assigned to a cluster with CPUduster = 5 will leave 3 unused nodes. 
Thus, we present an analysis of cluster utilization using the metric in Equation 

El 


CPUreq 

tl-cluster — 

^ cluster 


( 6 ) 








j-^cluster 

dRAP 

FIFO 

100% 

56% (54%,58%) 


Table 2. Average cluster utilization of dRAP and FIFO scheduling algorithms with 
95% confidence intervals 


If CPU cluster < CPUreq, we simply set ^cluster = 1- Values are presented as 
percentages in Table[^(note that dRAP’s ^.cluster is always 100% by definition). 

Finally, our third experiment is designed to measure global node utilization 
over the time of the simulation. Here we simply document the number of nodes 
that do computation on a given timestep and normalize by the total number 
of nodes in the system. Results are displayed in Figure (taken from a single 
simulation run). 


Global Node Usage 



RAP 

FIFO 


Fig. 3. Global utilization of nodes throughout simulation 


We observe that the dRAP algorithm utilizes approximately 90-95% of the 
nodes for the majority of the simulation, while FIFO utilizes approximately 70- 
75%. 


10.2 Effective Complexity 

For this experiment, we estimate the “effective” computational complexity of the 
dRAP algorithm. That is, in comparison to the qualitative 0{nm) worst case 
runtime per timestep, we are interested in how much of the task queue must be 
traversed in order to properly fit the CPUduster == CPUreq requirement. Total 
tasks traversed per timestep from one selected simulation run are presented in 











Figure]^ Note that the initial traversal (timestep “0”), although difficult to see, 
is approximately 11,000. 

Queue Traversal 



Fig. 4. Total task traversals by all clusters per timestep. 


“Worse case” here, as addressed above, is 0(nm), or 100, 000 tasks traversed 
per timestep if n = number of clusters = number of nodes = 100 and m = number 
of tasks = 1000. From this plot (plus additional runs not included here), we can 
conclude that effective computational complexity is no more than approximately 
10% of the worst case runtime 0{nm). 


10.3 Scaling 

For our last experiment, we are interested in collecting information on the scaling 
ability of our algorithm. Our goal in this test is to increase the number of nodes 
(in intervals of 50), while also maintaining an equal number of neighbors for 
each node. That is, we ensure that the neighborhood size parameter defined in 
the simulation scales inversely with the number of nodes s.t. a given node has 
approximately the same number of neighbors regardless of the total nodes in the 
system. Results are presented for our two timing metrics: Tcompiete scaling in 
Figurej^and Tyjait scaling in Figurej^ Data in both figures are log 2 -transformed 
in order to correlate doubling of nodes with halving of the timing metrics. 

We note a near-perfect scaling for both timing metrics, as shown in the fitted 
power law equations inset into each plot (Figures and |^. Note that the Ty^ait 
exponent above 1 is most likely a result of inexact tuning of the neighborhood 
size with increasing nodes, and this issue will rectified in future work. 
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Scaling 



Fig. 5. Scaling of Tcompiete 



Nodes 

Fig. 6. Scaling of T^ait 


11 Related Work 


Resource allocation for grid computing is an active area of research. For example, 
SLURM [20] is a configurable Linux utility for cluster allocation that uses static 
allocation of nodes to clusters, called partitions, in contrast to the dynamic 
cluster allocation presented in this paper. LSF |21] is another proprietary cluster 
management facility, however details of its allocation algorithm are not publicly 
available. 


12 Conclusions and Future Work 

In this paper we have presented an algorithm for allocating, scheduling and load¬ 
balancing processes on a massively distributed system. This is very relevant to 
current research in operating systems, especially with a trend of moving compu¬ 
tation tasks onto inexpensive, distributed hardware. The proposed decentralized 
algorithm draws inspiration from biology, adaptively creating and dissociating 
clusters from nodes to match task demand. Decentralization enables scalabil¬ 
ity, robustness, alleviation of computing load on monitor, better response and 
adaptability to process queue fluctuations and learning about process require¬ 
ments. The dRAP algorithm outperforms a FIFO scheduling algorithm on time 
to complete all tasks, average waiting time and CPU utilization. The scheduling 
is also shown to be robust to a malicious adversary that might permute the or¬ 
der of the tasks such that high demand tasks would be queued first followed by 
low demand tasks. A key feature of our procedure is that nodes communicate 
with neighboring computers in order to dynamically form clusters. Hence our 
algorithm also holds promise in areas where it is advantageous to communicate 
with immediate neighbors due to network latency, e.g. Google MapReduce uses 
a locality optimization to reduce latency due to network communication |5] . The 
comparison of this algorithm to other scheduling algorithms like SRTF (Shortest 
Remaining Time First) on other metrics like response time, as well as collection 






of data on the exact distribution of process demand in a queue in a real-world 
scenario, will be the subject of future investigation. 
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